Skip to content

[PoC] Push simple filter conditions into $lookup stage. #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Jul 18, 2025

In this PR we tackle the slow join pushing the simple conditions inside the $lookup. the condition is also fix a bug.

New query:

[{'$lookup': {'as': 'm2m_regress_selfrefer_references',
              'from': 'm2m_regress_selfrefer_references',
              'let': {'parent__field__0': '$_id'},
              'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0',
                                                                   '$to_selfrefer_id']},
                                                          {'$eq': ['$from_selfrefer_id',
                                                                   ObjectId('687ad601daccf7b757ff46ec')]}]}}}]}},
 {'$unwind': '$m2m_regress_selfrefer_references'},
 {'$match': {'$expr': {'$eq': ['$m2m_regress_selfrefer_references.from_selfrefer_id',
                               ObjectId('687ad601daccf7b757ff46ec')]}}}]

Previous query

[{'$lookup': {'as': 'm2m_regress_selfrefer_references',
              'from': 'm2m_regress_selfrefer_references',
              'let': {'parent__field__0': '$_id'},
              'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0',
                                                                   '$to_selfrefer_id']}]}}}]}},
 {'$unwind': '$m2m_regress_selfrefer_references'},
 {'$match': {'$expr': {'$eq': ['$m2m_regress_selfrefer_references.from_selfrefer_id',
                               ObjectId('687ad676396a5de149f2ea4d')]}}}]

@WaVEV WaVEV marked this pull request as ready for review July 18, 2025 23:21
@WaVEV WaVEV changed the title Push simple filter conditions into $lookup stage. [PoC] Push simple filter conditions into $lookup stage. Jul 18, 2025
@WaVEV WaVEV requested review from timgraham and Jibola July 19, 2025 18:33
@WaVEV WaVEV force-pushed the INTPYTHON-635-Slow-joins-for-ManyToMany-fields branch from 129893f to 3e685a8 Compare July 21, 2025 19:29
@timgraham
Copy link
Collaborator

Does it address #309? Do we need some additional tests for other cases besides the changed test?

I'll have to take it on faith that this does the right thing, because it would take me a lot of time to understand this code. Some comments explaining variable names like promote_filters, pushed_expressions, and rerooted_replacement as well as what cases this covers (what are "simple cases"?) would be helpful.

@WaVEV
Copy link
Collaborator Author

WaVEV commented Jul 22, 2025

Does it address #309? Do we need some additional tests for other cases besides the changed test?

I'll have to take it on faith that this does the right thing, because it would take me a lot of time to understand this code. Some comments explaining variable names like promote_filters, pushed_expressions, and rerooted_replacement as well as what cases this covers (what are "simple cases"?) would be helpful.

It fixes #309, but there’s room for improvement. It tries to push some conditions into the $lookup. Right now, it only pushes the simplest ones, the top level conditions in the WhereNode that match the nested $lookup collection.

Could it be better? Yes. We could create a WhereNode instead of just a list of conditions. We could also walk the condition tree semantically to push more than just the obvious ones. But we have to be careful. Operators like not and or can't be pushed directly.

Should it have more tests? Maybe, but we didn’t add extra tests for previous join strategies either. It passes the existing Django join tests.

EDIT: On the other hand, this is a PoC. If it seems to work in the performance test, it could be implemented.

@WaVEV WaVEV force-pushed the INTPYTHON-635-Slow-joins-for-ManyToMany-fields branch from 3e685a8 to 764cd51 Compare July 22, 2025 23:39
Copy link
Contributor

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the current PoC addition without tests. However, if we can get a proper design of the optimization outside of this patch-fix, I'd say then we need to write out a proper test suite.

Comment on lines +558 to +562
if (
isinstance(expr, Lookup)
and isinstance(expr.lhs, Col)
and (is_direct_value(expr.rhs) or isinstance(expr.rhs, Value))
):
promote_filters[expr.lhs.alias].append(expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but we can definitely have two options:

  • If it's a direct_value

Directly promote the filter

  • If it's a sub-expression

convert the sub-expression and then place it in the promote_filters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea.
Now the conditions are pushed up (not sure if that's the right term, but I mean passing the filter into the $lookup) only if the Django lookup looks like one of the following:

  1. Column operator value
  2. Column operator Column ← this one might actually be broken in the current code 😬

What the code does now is:
It filters out all conditions that involve having, subqueries, or composed conditions like A and (B or C).

Why?
For subqueries, we need to promote or isolate certain $lookups in the pipeline.
For composed conditions, we need to analyze the expression tree to determine which parts can be pushed up. For example, in A and (B or C), if A, B, and C are atomic and B and C can filter the outer collection, then we can push up the filter B or C.

@@ -165,18 +152,47 @@ def join(self, compiler, connection):
# based on their rerouted positions in the join pipeline.
replacements = {}
for col, parent_pos in columns:
column_target = Col(compiler.collection_name, expr.output_field.__class__())
target = col.target.clone()
target.remote_field = col.target.remote_field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not cloned from clone()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to answer yes, but the ManyToMany field isn't well cloned, because of some initialization, so I have to copy the remote_field.

Comment on lines +167 to +180
lookup_pipeline = []
lhs_fields = []
rhs_fields = []
# Add a join condition for each pair of joining fields.
parent_template = "parent__field__"
for lhs, rhs in self.join_fields:
lhs, rhs = connection.ops.prepare_join_on_clause(
self.parent_alias, lhs, compiler.collection_name, rhs
)
lhs_fields.append(lhs.as_mql(compiler, connection))
# In the lookup stage, the reference to this column doesn't include
# the collection name.
rhs_fields.append(rhs.as_mql(compiler, connection))
# Handle any join conditions besides matching field pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the portion of code that generates the code shown below?

{
  $lookup: {
  ...
  $pipeline: [
  ...
      { $and: [
          ...,
          {$eq: ['$$parent_field_0',  '$_id']}  # This line?
          ]
       }
  ]
  }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the same code as before, I only created a function because the process applied to extra is applied to the new filters

Comment on lines +183 to 187
if extra:
replacements = _get_reroot_replacements(extra.leaves())
extra_condition = [extra.replace_expressions(replacements).as_mql(compiler, connection)]
else:
extra_condition = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming these are the expressions that we aren't able to easily convert yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that was old code, some lookups have extra conditions and others haven't , I did nothing new with extra. So if no extra, no extra conditions 😄

@WaVEV WaVEV closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants